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1  Introduction 

It  is  well  known  that  a  3D  object  can  be  recognised  ir¬ 
respective  of  pose  if  a  3D  model  or  a  sufficient  number 
of  2D  (model)  views  are  available,  together  with  the  cor¬ 
respondence  of  their  feature  points.  Under  the  assump¬ 
tion  of  orthographic  projection  and  in  the  absence  of 
self-occlusions,  the  theoretical  lower  limit  for  the  num¬ 
ber  of  necessary  views  b  two  (  1.5  views  theorem,  see 
Poggio,  1990  and  UUman  and  Basri,  1991).  A  view  b 
represented  as  a  2N  vector  ■  ■  ■  ,*n ,ys  of 

the  coordinates  on  the  image  plane  of  N  labeled  and 
vbible  feature  points  on  the  object.  All  features  are  as¬ 
sumed  to  be  vbible,  as  they  are  in  wue-&ame  objects 
(see  figures  1,2).  The  generalbation  to  opaque  objects 
follows  by  partitioning  the  viewpoint  space  for  each  ob¬ 
ject  into  a  set  of  “aspects”  [5],  corresponding  to  stable 
clusters  of  visible  features. 

Psychophysical  experiments  [1]  using  wire-bame  and 
other  objects  suggest  that  a  relatively  small  number  of 
views  -  but  higher  than  two  and  probably  between  20 
and  100  -  are  used  by  the  human  vbual  system,  which 
seems  capable  of  generalising  to  novel  views  by  “interpo¬ 
lating”  between  the  few  model  views.  These  experiments 
are  consistent  with  a  network  model  proposed  by  Poggio 
and  Edelman  (1990),  in  which  each  hidden  unit  b  similar 
to  a  view-centered  neuron  tuned  to  one  of  the  example 
views  (or  to  prototypical  views)  whereas  the  output  can 
be  view-independent  if  enough  training  views  are  pro¬ 
vided. 

Often  we  are  able  to  recognbe  3D  objects  on  the  scJe 
basis  of  their  shape  after  seeing  only  one  view.  Thb  b 
the  case  for  faces,  at  least  to  some  extent.  It  b  therefore 
interesting  to  ask  in  general  whether  invariance  proper¬ 
ties  of  the  object  may  reduce  the  number  of  model  views 
necessary  for  recognition. 

2  Exploiting  Bilateral  Symmetry  for 
Recognition 

Classes  of  objects  with  parallel  faces  and  objects  with 
orthogonal  faces,  such  as  most  man-made  objects,  pro¬ 
vide  interesting  examples  of  such  invariance  properties. 
It  can  be  shown  that  they  are  instances  of  so  called  lin¬ 
ear  classes  of  objects  [12].  Information  that  an  object 
belongs  to  one  of  these  classes  reduces  the  number  of 
required  model  views.  A  particularly  interesting  exam¬ 
ple  is  the  class  associated  with  the  property  of  bilateral 
symmetry.  It  b  easily  shown  [12]  that,  given  a  model 
view  -  such  as  the  one  in  figure  la  -  and  prior  informa¬ 
tion  that  the  corresponding  3D  object  is  bilaterally  sym¬ 
metric,  other  “virtual”  views  can  be  generated  by  the 
appropriate  symmetry  transformations  (see  figure  lb). 
It  seems  plausible  that  these  new  vutual  views  contain 
additional  information  that  can  be  exploited  for  better 
recognition.  In  the  special  case  of  orthographic  projec¬ 
tion  with  views  defined  as  above  the  intuition  can  be 


made  precise:  for  any  bilateraUy  symmetric  3D  object, 
one  non-accidental  2D  model  view  b  sufficient  for  recog¬ 
nition  [12].  Notice  that  in  thb  proof  a  perfectly  frontal 
view  b  an  accidental  view  and  b  not  sufficient  by  itself 
for  recognition  of  novel  views.  One  does  not  need  to 
know  the  symmetry  plane  but  simply  the  pairs  of  sym¬ 
metric  point  features.  Sjrmmetries  of  higher  order  than 
bilateral  allow  the  recovery  of  structure  from  just  one 
2D  view  [12].  Also  in  the  perspective  case  symmetry  b 
a  useful  constraint  [4,  7]  for  recognition. 

8  Psychophysics 

While  the  theoretical  results  [12]  establbh  a  minimum 
number  of  model  views  needed  for  recognition  of  bilat¬ 
erally  symmetric  objects,  a  practical  prediction  for  the 
psychophysics  of  object  recognition  b  that  fewer  views 
should  be  needed  in  the  case  of  symmetric  relative  to 
asymmetric  objects  (see  figure  2)  for  the  same  level  of 
generalisation  from  a  single  model  view.  Thb  b  a  gen¬ 
eral  prediction,  independent  of  the  specific  recognition 
scheme,  and  it  only  assumes  that  the  vbual  system  can 
exploit  the  information  contained  in  bilateral  symme¬ 
try  which  allow  to  generate  virtual  views  from  the  given 
ones.  It  b  reasonable  to  expect  that  recognition  of  sym¬ 
metric  objects  b  also  done  in  a  suboptimal  way,  since 
in  the  case  of  non-symmetric  objects  the  human  vbual 
system  needs  [1,  6]  significantly  more  model  views  (20- 
100)  than  the  theoretical  minimum  of  two  (which  b  valid 
for  orthographic  projection  only  and,  more  importantly, 
for  very  specific  view  features  -  the  x,y  coordinates  of 
corresponding  points). 

If  we  consider  the  interpolation-type  or  classification 
modeb  for  vbual  recognition  -  such  as  HBF  networks 
-  that  are  supported  by  the  psychophysical  experiments 
of  BulthoiT  and  Edelman  (1992),  we  can  make  a  more 
specific  prediction.  For  each  example  view  used  in  train¬ 
ing,  the  RBF  version  of  the  HBF  network  (see  Poggio 
and  Edelman,  1990)  allocates  a  center,  that  b  a  unit 
with  a  Ganssian-like  recognition  field  around  that  view. 
The  unit  performs  an  operation  that  conld  be  described 
as  “blurred”  template  matching  by  measuring  the  sim¬ 
ilarity  of  the  view  x  to  be  recognbed  with  the  training 
view  t  to  which  the  unit  b  tuned.  The  activity  of  the 
unit  depends  then  on  thb  similarity  through  a  Gaussian 
function  G(||x  —  tjj).  At  the  output  of  the  network  the 
activities  of  the  various  units  are  combined  with  appro¬ 
priate  weights,  found  during  the  learning  stage.  In  the 
more  general  HBF  scheme  the  number  of  units,  that  b 
templates,  used  during  recognition  may  be  less  than  the 
number  of  training  views  and  in  addition  the  appropriate 
similarity  metric  b  found  automatically  during  learning 
(see  Poggio  and  Guosi,  1990).  An  example  of  a  recogni¬ 
tion  field  measured  psychophjrskally  for  an  asymmetric 
object  after  training  with  a  single  view  b  shown  in  fig¬ 
ure  3a.  As  predicted  from  the  model  (see  Poggio  and 
Edelman,  1990),  the  shape  of  the  surface  of  the  recogni- 


tion  ettors  is  Gsntsian-like  (more  precisely  a  monoionic 
transformation  of  a  Gaussian)  and  is  centered  around  the 
truning  view.  In  the  case  of  symmetric  objects,  the  pre¬ 
diction  is  that  the  system  exploits  symmetry  by  creating 
&om  a  single  training  view  additional  virtual  views  and 
allocating  the  corresponding  new  centers,  as  shown  in 
figure  ls,b.  The  expected  overaU  effect,  as  measured  by 
the  psychophysical  technique  of  Bulthoff,  Edelman  and 
Sklar  (1991),  would  then  be  a  broader,  possibly  multi- 
peaked  recognition  field. 

Our  experimental  data  are  in  agreement  with  both 
these  predictions.  Recognition  of  novel  views  given  a 
single  training  view  is  significantly  better  for  symmet¬ 
ric  than  for  asymmetric  objects  (77%  correct  versus  64% 
correct,  averaged  over  aU  testing  views).  In  addition,  the 
recognition  field  is,  as  expected,  multipeaked  and  elon¬ 
gated  (figure  3b)  in  the  correct  direction,  orthogonal  to 
the  symmetry  plane.  Figure  4  shows  that  the  broadening 
of  the  generalisation  field  occurs  for  symmetric  objects 
exactly  in  the  direction  of  the  closest  virtual  view  and 
that  by  increasing  the  distance  of  the  virtual  view  it  is 
possible  to  resolve  the  expected  two  peaks. 

A  remark  about  the  physiological  implications  of  our 
results  is  in  order  here.  Suppose  that  training  to  a  view 
of  a  3D  object  creates  a  group  of  neurons  tuned  to  that 
view.  In  the  case  of  bilaterally  symmetric  objects  the 
virtual  views  induced  by  symmetry  may  correspond  to 
different  neurons  specifically  tuned  to  them.  A  perhaps 
more  likely  alternative  is  that  features  with  the  appropri¬ 
ate  symmetry  invariance  (see  Moses  and  UUman,  1991) 
are  used  (instead  of  a,  y  position  of  feature  points),  in 
which  case  the  same  neurons  tuned  to  the  training  view 
would  also  respond  to  the  virtual  views  induced  by  sym¬ 
metry. 

The  key  problem  in  all  schemes  for  learning  from  ex¬ 
amples,  such  as  RBF  networks  and  various  types  of  neu¬ 
ral  networks,  is  the  number  of  required  examples  for  a 
given  task.  Often  an  insufficient  number  of  examples  are 
available  or  obtainable.  A  case  in  point  is  the  recognition 
of  a  3D  object,  such  as  a  face,  from  a  single  training  ex¬ 
ample  (i.e.,  a  model  view).  An  attractive  solution  to  this 
general  problem  is  to  exploit  prior  information  to  gen¬ 
erate  additional  examples  from  the  few  available.  We 
have  already  shown  that  prior  information  about  bilat¬ 
eral  symmetry  and  other  geometrical  properties  of  ob¬ 
jects  such  as  coUinearity  and  edges  at  right  angles,  could 
be  used  in  theory  to  do  just  that  [12].  Here  we  have  pro¬ 
vided  evidence  that  the  brain  seems  able  to  exploit  this 
type  of  prior  information  and  seems  to  do  so  consistently 
with  a  model  of  recognition  that  is  based  on  the  memory 
of  the  training  views  -  possibly  through  neurons  tuned 
to  them  -  and  of  the  virtual  views  induced  by  symmetry. 

Several  open  questions  remain.  It  is  natural  to  spec¬ 
ulate  that  visual  recognition  of  3D  objects  may  be  the 
main  reason  for  the  well  known  sensitivity  of  our  visual 
system  to  bilateral  symmetry.  How  does  then  our  visual 


Figure  1:  Given  a  single  tD  model  view  (upper  left),  a 
virtual  view  (upper  right)  can  be  generated  by  an  appro¬ 
priate  transformation  induced  by  the  assumption  of  bi¬ 
lateral  symmetry  (under  orthographic  projection).  This 
transformation  exchanges  the  x  coordinates  of  bilater¬ 
ally  symmetric  pairs  of  features,  and  changes  their  sign 
(see  Poggio  and  Vetter,  J992).  The  operation  leads  to 
a  virtual  view  which  is  not  a  simple  mirror  image  (note 
the  labels  indicating  corresponding  points!)  and  which 
is  a  “legal”  view  of  the  SD  object:  the  views  in  the  up¬ 
per  left  and  upper  right  are  images  of  the  same  SD  ob¬ 
ject  appropriately  rotated.  Other  legal  views  (below  left 
and  right,  for  instance)  can  be  generated  by  appropri¬ 
ate  transformations  associated  with  bilateral  symmetry: 
each  of  these  other  views  can  be  obtained,  however,  as 
a  linear  combination  of  the  two  above  views.  The  im¬ 
ages  at  the  top  left  and  bottom  left,  can  be  interpreted  as 
the  image  of  a  (transparent)  object  seen  from  two  differ¬ 
ent  viewpoints,  simply  by  exchanging  symmetric  feature 
points.  These  two  interpretations  (a  nd  c)  are  similar 
to  the  bistable  perception  of  the  Necker  cube  type,  which 
therefore  provides  an  actual  and  a  “virtual”  view  of  a 
bilaterally  symmetric  object. 
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Figure  2:  (a)  The  model  view  of  a  SD  non-tymmetric 
object  (center).  The  eurrounding  images  show  examples 
of  other  views  (30*  rotation  around  horizontal  or  vertiad 
axis)  of  the  same  object  used  for  testing  generalization  to 
different  view  points.  In  the  experiment,  novel  views  are 
presented  intermixed  with  ^ulraciorj,  that  is  views  of 
other  similar  objects  (see  Buithoff  and  Edebnan,  199t). 
(b)  An  example  of  the  bilaterally  symmetric  objects  used 
in  our  psychophysical  experiments. 
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Figure  3:  The  generalization  field  associated  with  one 
training  view  of  non~symmetrie  objects  (a)  (see  also 
Edelman  and  BvHhoff,  J99t)  and  symmetric  objects  (b). 
The  recognition  performance  for  wire-Uke  objects  (see 
figure  t)  increases  with  distance  from  the  training  view 
roughly  prince  the  exact  nature  of  the  feature  space  is  «n> 
known)  as  expected  for  a  Gaussian-like  unit  tursed  to  the 
training  view  (a).  In  (b)  the  generalization  field  w  multi- 
peaked  (see  figure  la)  and  elongated  in  the  horizontal  di¬ 
rection  as  expected  from  the  presence  of  additional  «nt(< 
tuned  to  the  virtual  views  induced  by  symmetry  of  the 
objects.  The  generalization  field  is  defined  as  the  recog¬ 
nition  rate  for  views  similar  to  the  training  view:  means 
of  error  rates  of  14  subjects  and  32  different  objects  are 
plotted  vs.  rotation  in  depth  around  the  two  axes  in  the 
image  plane.  The  extent  of  rotation  was  ±90*  in  each  di- 
rection;  the  center  of  the  ^t  corresponds  to  the  training 
attitude.  The  numbers  represent  the  mean  percentage  of 
correct  recognized  target  objects  and  correct  rejected  dds- 
(ractor  objects  (Hit  -h  CR).  Target  and  distractor  objects 
were  randomly  displayed  in  eguoi  proportions. 
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Figure  4;  The  graphs  show  the  recognition  performance 
over  a  (±90*)  rotation  range  around  a  fixed  axis.  The 
object  was  presented  at  0* .  The  data  in  (a)is  taken  from 
figure  Sh.  In  this  situation  the  virtual  views  were  located 
at  ±90*  (thin  arrows);  In  (h)  the  virtual  views  were  at 
54*  (thin  arrow)  and  at  —126*  (not  shown),  as  a  conse- 
guence  of  a  different  orixntation  of  the  training  view.  In 
both  eases,  the  graph  shows  peaks  at  the  location  of  the 
virtual  views,  as  predicted. 


■yttem  detect  lymmettic  pain  of  features?  Some  of  the 
natural  strategies  (see  for  instance  Reisfeld,  WoUson  and 
Yeshnrun,  1990)  would  require  extensWe  and  specialised 
circuitry  in  the  visual  system  and  neurons  specialised  in 
detecting  bilaterally  symmetric  features  such  as  the  vir¬ 
tual  lines  connecting  pairs  of  bilaterally  symmetric  fea¬ 
ture  points  (that  are  always  parallel  to  each  other).  Is 
it  possible  to  extend  our  results  to  geometric  constraints 
other  than  bilateral  symmetry?  Can  neurons  be  found, 
possibly  in  IT,  with  recognition  fields  consistent  with  the 
psychophysics  (figures  3a,b)  and  the  model?  Another 
important  set  of  questions  concerns  how  to  learn  class 
specific  transformations  -  for  instance  the  transforma¬ 
tion  that  “ages”  a  face  -  and  whether  the  brain  indeed 
can  learn  and  use  them  to  effectively  generate  additional 
virtual  model  views  for  tasks  of  recognition. 
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